A path planning approach for mobile robots using short and safe Q-learning

Path planning is a major challenging problem for mobile robots, as the robot is required to reach the target position from the starting position while simultaneously avoiding conflicts with obstacles. This paper refers to a novel method as short and safe Q-learning to alleviate the short and safe path planning task of mobile robots. To solve the slow convergence of Q-learning, the artificial potential field is utilized to avoid random exploration and provides a priori knowledge of the environment for mobile robots. Furthermore, to speed up the convergence of the Q-learning and reduce the computing time, a dynamic reward is proposed to facilitate the mobile robot towards the target point. The experiments are divided into two parts: short and safe path planning. The mobile robot can reach the target with the optimal path length in short path planning, and away from obstacles in safe path planning. Experiments compared with the state-of-the-art algorithm demonstrate the effectiveness and practicality of the proposed approach. Concluded, the path length, computing time and turning angle of SSQL is increased by 2.83%, 23.98% and 7.98% in short path planning, 3.64%, 23.42% and 12.61% in safe path planning compared with classical Q-learning. Furthermore, the SSQL outperforms other optimization algorithms with shorter path length and smaller turning angles.


Introduction
Path planning is the calculation of a feasible path from a start node to a goal node in a map or grid without colliding with obstacles on the way [1]. It requires mobile robots (MRs) to be equipped with sensors, onboard computers, and motion systems to plan and move. Many current algorithms can implement path planning for MR. The former approach includes searchbased planning algorithms: A � [2], D � [3], etc. A � algorithm is one of the popular heuristic algorithms [4], which is widely used in the field of path optimization. Its unique feature is that it introduces global information when checking each possible node in the shortest path, estimates the distance between the current node and the end point. Derived from the traversal A � , D � algorithm repairs the existing search information instead of reconstructing the whole search graph [5,6]. It is suitable for dynamic environment compared to A � . Due to their a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 universality and ease of implementation, these algorithms achieve significant results in searching for paths, but the search time grows exponentially with the resolution size and search depth of the map; Sampling-based planning algorithms: rapidly random-exploring tree (RRT) [7], probabilistic roadmap (PRM) [8], etc. The advantage of these algorithms is that they are effective and fast in high-dimensional path search, and the disadvantage is that these algorithms usually sample the environment for random search to find paths, the results are often not optimal and it is difficult to find a feasible path in environments with narrow passages. Artificial potential field (APF) [9] and BUG algorithm [10] are widely used in path planning for local obstacle avoidance. Although these algorithms are computationally simple for dynamic obstacle environments and fast in path search, the optimal path is often not obtained, and the search path may be erroneous when the obstacles are large in a complex environment. Another type of algorithm is the intelligent algorithms: it is an algorithm that people model by nature-inspired or human mind to imitate solving problems [11]. Typical algorithms are particle swarm optimization (PSO) [12], ant colony optimization (ACO) [13], and other algorithms. Intelligent algorithms play an effective role in solving complex dynamic environments, but there are common problems such as slow computation speed, poor stability, poor realtime performance, and easy to fall into local optimality.
Path planning can be divided into global and local path planning depending on the acquisition of the environment [14]. Global path planning is to plan a path for MR in a completely known environment. The location and shape of obstacles in the environment are known to the MR. The global path planning is often optimal or sub-optimal. As the robot travels along the global path, if there are obstacles in the global path that are not modeled in the known environments, the robot will collide with the obstacles. This requires local path planning in completely unknown or partially known environments. The local planning method integrates the modeling of the environment with the search, which provides real-time feedback and correction of the planning results. However, the planning results may not be optimal due to the lack of global environmental information. The two planning methods need to work together to allow the robot to better plan its travel path from the start to the target position.
To solve global and local path planning problems, a common approach is to hybridize two algorithms [15][16][17][18]. In [19], the authors proposed a method using the A � algorithm for global path planning, and the APF for local path planning in unknown areas. It provides the best solutions in less time when obtaining optimal path distance. [20] proposed an improved ACO for a globally optimal path, and an improved APF is subsequently employed to avoid unknown obstacles during navigation. The authors in [21] proposed an improved PSO for global path planning and APF for local dynamic obstacle avoidance, to solve the local minimum problem. The two algorithms are combined to solve the MR global and local path planning problem, avoiding the defects of a single algorithm that is prone to get trapped in a local minimum [22].
All of the above-mentioned algorithms have their superiorities and limitations. Most of the researches are environment-based precise modeling and positioning navigation. However, the real environments are partially or completely unknown. It requires MR evasion of unknown static and dynamic obstacles. Moreover, most of the papers treat MR as a mass point to plan the path, resulting in the path being too close to the obstacle, which leads the MR to collide with obstacles [23].
To solve the defects of the above path planning method, the proposed approach enables the mobile robot to plan the path in two modes: short and safe path planning. The effectiveness, superiority, and rapidity of the SSQL algorithm are demonstrated through simulation and comparison experiments. The SSQL algorithm proposed in this paper is applied to solve the path planning for MRs with the following main contributions: 1. The SSQL is a novel proposal that can find short and safe condition for MR path planning, outperforming motion planning proposals based on the state-of-the-art algorithms.
2. Two path planning schemes are proposed, one is to plan the shortest path length (Proximity to obstacles), and the other is to plan the safe path length (away from obstacles) from the starting point to the target point.
3. This paper hybridizes Q-learning with APF to initialize Q-table, avoiding the random movement around the start point at the beginning of the algorithm.
4. The proposed algorithm changes the constant reward into a combination of static and dynamic reward; speeds up the convergence of the algorithm and avoids the algorithm falling into a dead-end path blocked by obstacles.
5. By comparing with other algorithms, it is proved that the proposed algorithm in this paper can solve the MR path planning problems for two modes: short and safe path planning.

Materials and methods
In this section, this paper first describes the MR path planning problem, then outlines classical Q-learning, and at last proposes the SSQL algorithm.

Problem formulation
To better illustrate the planning problem using SSQL in this paper, path planning process is shown in Fig 1. The schematic diagram of MR and the definition of safe distance are shown in Fig 1A, where, x a and y a represent the earth-fixed frame, x b and y b represent the body-fixed frame of MR, r represents the radius of the circular MR. It is assumed that the origin of the body-fixed frame is located in the MR centre of mass. To plan the path away from obstacles, this paper defines the MR safety radius as r sd , to avoid collision of the planned path with obstacles. For better obstacle avoidance, this paper defines: when there are no obstacles around MR can move to the adjacent eight directions (north, northeast, east, southeast, south, southwest, west and northwest). MR cannot cross the obstacles barrier diagonally. When the obstacles are located around the MR, the movement directions of MR are shown in Fig 1B. In a Cartesian coordinate, the MR aims to plan a feasible path from the start S (1,4) to the target T (7,4) without collision

PLOS ONE
with obstacles. Fig 1C and 1D show two models of planning path for a MR. Fig 1C is the short path length mode, the MR can reach the target from the start with the short path length. Fig  1D is the safe condition mode, the MR always distance from obstacle 2r sd planning path.

Q-learning algorithm
Q-learning [24] is a value-based reinforcement learning algorithm proposed by Watkins in 1989. MR path planning can be expressed as follows: at each discrete time series (off-line strategy temporal-difference) [25]. The MR interacts with the environment through continuous feedback, generating more data (states and returns) and using the new data to further improve its own behavior. Q-table is the expectation that Q(s t ,a t ) can gain by taking action a t in state s t , updated by the Formula shown in Formula 1.
The above function can also be written as Formula 2 and Formula 3: where, α (0�α�1) is a learning rate parameter, γ (0�γ�1) is a discount rate parameter. In this paper, the Q-learning is applied to MR path planning for the following reasons: 1. Reinforcement learning has good interaction with the environment without the need for positive or negative labels [26]. MR gains current knowledge by exploring and learning from the environment, and improve their operational strategies to adapt to the environment.
2. The Q-learning algorithm is highly exploratory and is an iterative trial-and-error process, where multiple attempts are made for each possible pair of state actions to obtain the optimal policy as long as time allows.
3. Q-learning uses off-policy [27], the selection of actions according to the target strategy can be used to control the distance between MR and the obstacle.

The proposed short and safe Q-learning algorithm
In this section, the proposed SSQL algorithm is elaborated in four stages for solving the short and safe path planning. To better describe the studied content, the MR path planning problem discussed in this paper has the following premises: 1. The MR and the obstacle environment are three-dimensional objects in practice. To simplify the problem, this paper treats their motion space as two-dimensional coordinates.
2. The MR path planning studied in this paper is divided into two parts: short and safe path length. The location, shape and size of obstacles in the environment are known in the path planning.
3. The task of the MR is to reach the target from the start with the shortest and safe path length.

Q-table initialization
Classical Q-learning usually initializes the Q-table to zero or normally distributed random numbers, which leads the MR to choose actions randomly in the exploration phase. Resulting in slow convergence and long computing time of the algorithm. To optimize the problem, this paper combines APF with Q-learning to optimize the initial Q-table. The reasons are: the APF can be combined with the Q-learning both in global and local path planning; it is easy to implement in grid maps and provide prior knowledge of the environment for MR; to speed up convergence and reduce computing time.
The information in the environment (starting point, target point, obstacle shape, size and location coordinates) is completely known for MR, so Coulomb's law is used to model the APF for the grid environment, as shown in Formula 4, Formula 5 and Formula 6: where, U a (s t ) is the gravitational field producing gravitational force in state s t ; U r (s t ) is the gravitational field producing gravitational force in state s t ; U n (s t ) is the total potential energy of state s t ; ρ g (s t ) is the Euclidean distance between state s t and the center of the target point; ρ ob (s t ) is the Euclidean distance between state s t and the center of the obstacle; ρ o is the obstacle influence factor; k a and k r are the scale factors; U(s t ) is the potential energy in state s t ; U max is the highest potential energy in state s t

Action selection
To avoid obstacles and reach the target point with the short path length, eight directions of movement are defined for the MR (north, northeast, east, southeast, south, southwest, west and northwest). When the safe distance of MR is defined as r sd = 5m, the corresponding movement and movement distance are: action1: a 1 = move north 10m; action2: a 2 ¼ move northeast 10 ffi ffi ffi 2 p m; action3; a 3 = move east 10m; action4: a 4 ¼ move southeast 10 ffi ffi ffi 2 p m; action5: a 5 = move south 10m; action6: a 6 ¼ move southwest 10 ffi ffi ffi 2 p m; action7: a 7 = move west 10m; action8: a 8 ¼ move northwest 10 ffi ffi ffi 2 p m

Reward function
To solve this problem, a combined static and dynamic reward function is proposed in the DMQL algorithm, which provides the target and current position as a priori knowledge to the MR. When the MR is closer to the target position, the larger the reward obtained, prompting it to move to the target position and speed up the convergence. In this paper, two different sets of reward functions are set for the two working modes of SSQL. The reward functions of the short path length are shown in Formulas 9~13. The planned path is shown in Fig 1C.
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where, r s is the static reward; r d is the dynamic reward; d t is the Manhattan distance from the target position in the state s t ; d t+1 is the Euclidean distance from the target position in the next state s t+1 ; (x t ,y t ) is the coordinate in the state s t ; (x t+1 ,y t+1 ) is the coordinate in the state s t+1 ; (x target ,y target ) is the coordinate in the target position. The reward functions of the safe path length are the same with Formula 9, Formula 11, Formula 12 and Formula 13. Formula 10 is modified as Formula 14. The planned path is shown in Fig 1D. By setting such a reward function the distance to the obstacle can be controlled.

Short and safe condition path planning
In this paper, the proposed SSQL algorithm is applied to the path planning problem of MR, and its basic ideas are as follows: the APF is used to initialize Q-table, to provide the prior knowledge in the environment to the MR. The static and dynamic reward are combined to optimize the reward function, to induce the MR to move toward the target point and control the distance between MR and obstacles.
To sum up, the flow of MR using SSQL algorithm is as follows: when MR is located at the starting point, select the mode (short or safe condition path planning). When considering MRs driving as far away from shoreline objects as possible to avoid collisions with reefs, the safe condition is used. When distance and energy consumption are taken into account, short condition path planning is used to path the shortest path length for MR. The combination of the two models allows for better path planning for MRs based on actual situations.

Results and discussion
It is difficult to apply the SSQL algorithm directly to the path planning problem of MR, and it requires repeated training to obtain the optimal action strategy, therefore, this paper demonstrates the performance and generality of the SSQL through several numerical simulations in this section.

Environments
This paper considers nine different environment maps for MR to validate our approach. Each map is defined as a square grid environment with sides of length 20 � 2r (r is the radius of MR). The simulation maps are represented by M01 to M09 as shown in Fig 2. The center of each

PLOS ONE
grid is marked by a Cartesian coordinate system, with the x-axis indicating the horizontal direction and the y-axis indicating the vertical direction. The coordinates are expressed as (x, y). The first and second dimensions of the grid map represent the x-horizontal and y-vertical coordinates, respectively.

Performance metrics
To test the effectiveness, safety, and speed of the proposed SSQL algorithm in a comprehensive and concrete way, the MR path is evaluated in three performance metrics: path length (Formula 15), turning angle (Formula 19), and computing time.

Path length (m)
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where, i = 0,1,2,. . .,n, when i = 0, the MR is at the starting position S = (x o ,y o ), when i =n, the MR is at the target position T = (x n ,y n ), (x i ,y i ) represents the coordinates of the current state of the MR; (x i+1 ,y i+1 ) represents the coordinates of the next state of the MR.

Turning angle (rad)
The turning angle is the sum of the change in heading angle of MR from the start to the target. When the turning angle is smaller, the path is smoother, less energy is consumed, and the completion time of the mission is shorter. This paper defines the turning angle of MR from the start to the target position as shown in Formula 16~Formula 19.
c i ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi AngleðradÞ ¼ where, i = 0,1,2,. . .,n, (x i ,y i ) represents the coordinates of the current state of MR; (x i+1 ,y i+1 ) represents the coordinates of the next state of MR.

Computing time (s)
Computing time is consumed by the algorithm through iterative computation is denoted as Time(s). The average value of 30 repeated tests in this paper is taken as the computing time. The shorter computing time, the less waiting time for MR to execute the action.

Parameters selection
Before the SSQL algorithm is executed, the values of the two parameters: learning rate α (0�α�1) and decay rate γ(0�γ�1) in the Q-value update Formula 1 need to be determined. According to Watkins et al. [24], when the α value is small, the agent goes through all states in the environment and calculates all possible actions, and the Q value converges to the optimal value. When the γ value is large, it can expand the exploration range of the agent and prevent the agent from falling into the problem of local optimum. Therefore, based on the above theory, when γ = 0.9, this paper records the number of iterations that path length converges to the optimum (Fig 2). Repeat the test 30 times to take the average value, considering the computing time of MR, both α and γ are taken as 0.9 [23].
To evaluate the generality and universality of the SSQL algorithm applied to the MR path planning problem, the SSQL algorithm is compared with CQL (Classical Q-learning), PSO (Particle Swarm Optimization), GWO (Grey Wolf Optimization) [28], DA [29] and MFO (Moth-Flame Optimization) [30] in different simulation environments, respectively. PSO [31] is the basic path planning comparison algorithm. It originates from the study of bird flock predation behavior, which is characterized by easy implementation; GWO is inspired by the leadership hierarchy and hunting mechanism of the gray wolf in nature, which has the characteristics of strong convergence performance and few parameters; DA and MFO are the state-of-the-art algorithms. DA is inspired by the static and dynamic swarming behaviors of dragonflies in nature, which has the advantages of strong stability, good search speed, and global searchability, etc. MFO simulates the special navigation mechanism of moths using lateral positioning during night flight, which has the performance characteristics of strong parallel optimization ability, global superiority, and not easy to fall into local minima. To test the SSQL algorithm proposed in this paper under the same conditions, the parameter settings of the compared algorithms are shown in Table 1.
where, Max is the maximumnumber of iterations, Pop is the population size; PSO: c 1 and c 2 are the learning factor, ω is the linearly decreasing weight (LDW) (Tian et al., 2018), ω max is the maximum inertia weight, ω min is the minimum inertia weight; GWO: a is a constant, the initial value is 2, and decreases linearly from 2 to 0 with the iteration of the algorithm, C is a random number between 0 and 2; DA: a is the alignment weight, c is the cohesion weight, e is the natural enemy weight factor, f is the prey weight factor, r is the neighborhood radius, and s is the separation weight; MFO: t is the path coefficient, b is the logarithmic spiral shape constant; rand() denotes a random number between [0,1], and Iter denotes the current iteration number. https://doi.org/10.1371/journal.pone.0275100.t001

Short path planning for mobile robot
In this section, a short path planning for MR using the SSQL is proposed. The proposed algorithm enables the MR to perform the task with the shortest or shorter path length from the starting point to the target point. To verify the effectiveness and generalizability of the proposed algorithm, tests were conducted on the nine grip maps in Fig 3. Each map shows the best path obtained by the SSQL algorithm (black solid line), S is the starting point and T is the target point.

Fig 3. Short path planning (solution) for MR in different test environments (Each map shows the best path obtained by the SSQL algorithm, S is the starting point and T is the target point).
https://doi.org/10.1371/journal.pone.0275100.g003

PLOS ONE
The optimal (suboptimal or optimal under optimal conditions) results of SSQL in 30 repeated tests (the path with the smaller turning angle and computing time under the same path length condition is taken) are shown in Fig 3 (the paths of other compared algorithms are not indicated in Fig 3 for a clearer representation of the paths of SSQL). Experimental data and analysis results were recorded in Table 2, and check whether there is a significant difference between SSQL algorithm and other comparison algorithms by t-test (considering 0.05 significance level). Compared mean path length of various path planning algorithms for MR are shown in Fig 4.

Safe path planning for mobile robot
In this section, a safe planning for MR using the SSQL is proposed. The proposed algorithm enables the MR to perform the task with the safe path length (keep away from obstacles) from the starting point to the target point. To verify the effectiveness and generalizability of the proposed algorithm, tests were conducted on the nine grip maps in Fig 3. Since the path planned by the comparison algorithms (PSO, GWO, DA and MFO) cannot move away from the obstacles, it usually adopts the way of expanding the obstacles to achieve the purpose of planning the safe path. In this paper, the obstacles are expanded by one grid (1x1 expanded to 3x3 grids) to plan the path of the comparison algorithms. Each map shows the best path obtained by the SSQL algorithm (black solid line), S is the starting point and T is the target point.
The optimal (suboptimal or optimal under optimal conditions) results of SSQL in 30 repeated tests (the path with the smaller turning angle and computing time under the same path length condition is taken) are shown in Fig 4 ( Experimental data and analysis results were recorded in Table 3, and check whether there is a significant difference between SSQL algorithm and other comparison algorithms by t-test (considering 0.05 significance level). Compared mean path length of various path planning algorithms for MR are shown in Fig 6. It can be seen that in the nine test maps (Figs 3 and 5), the SSQL can effectively solve the path planning problem of MR from the starting position to the target position without collision with obstacles under short path length condition, and it can achieve excellent results (path length and turning angle) in majority of experiments except M02. The results demonstrate that the proposed algorithm effectively reduces the random motion in the initial phase of MR, further accelerating the convergence speed and reducing the computing time compared to the CQL. Our method achieves the optimal path length and turning angle in different environments, which demonstrates its practicality, generalizability, and robustness. Compared to the existing methods, the proposed approach can effectively escape from trap points due to local minima and find the optimal path. The limitation is that the computing time is longer compared to the comparison algorithms, but it is within the acceptable range.
To demonstrate the superiority of the proposed algorithm over the CQL algorithm in terms of path length turning angle and computing time. In this paper, the percentage improvement on the CQL algorithm is recorded in Tables 4 and 5. The results demonstrate that the proposed algorithm effectively reduces the random motion in the initial phase of MR, further accelerating the convergence speed and reducing the computing time.
Test results under nine different cases validate that: 1. Our approach has the shortest path length and smallest turning angle of MR for short and safe path planning in all cases, compared with the comparison algorithms in this paper. 2. The proposed approach has the smallest standard deviations among the comparison algorithms in all cases, which demonstrates the superiority and stability of the DMQL algorithm under different environments. Tables 2 and 3, it can be seen that DSQL has improved compared to CQL, but the computing time slightly higher than PSO, GWO, DA and MFO.

From the time results in
4. The t-test results indicate that SSQL outperforms in all the cases because the values are smaller than the level 0.05 of significance.
In Figs 3A and 5A, we test the long-distance obstacle free path planning scheme. In Figs 3 and 5B-5F, we design the local minimum problem, which will cause the mobile robot to easily fall into the local optimum and cannot calculate the effective path. In Figs 3 and 5G-5I, we design the path planning problem under the maze and dense obstacles that may be encountered in the real environment. Through these maps and comparison algorithms, it can be proved that the proposed algorithm has the advantages of robustness, universality and wide application range. The experimental results show that the SSQL algorithm can enable the mobile robot to plan the path with the shortest distance and the smallest turning angle. The sub optimal algorithm is GWO, which performs well in most experiments, but its robustness and stability are not as good as SSQL. The performance of PSO is the worst in most experiments, which is specifically manifested in that the path is far away from the obstacles, resulting in the long path length, and the path is tortuous and the large turning angle. During the experiment, CQL cannot find the global optimal solution due to its slow convergence speed, resulting in the path tortuous, large turning angle and long path length. On the other hand, the biological heuristic algorithm is prone to premature convergence caused by falling into local optimization, while the principle of Q-learning algorithm leads to that the SSQL algorithm can finally find the global optimal solution under the condition that the time is long enough.
During the experiment, the mobile robot traverses the map to obtain the global information, and finds the optimal path (optimal path length and turning angle) in the process of continuous interaction with the environment. These results of path planning for MR show that: Our method achieves the optimal path length and turning angle in different environments, which demonstrates its practicality, generalizability, and robustness. PSO performs blind random search and cannot obtain the global map information, resulting in the inability to

PLOS ONE
calculate the optimal or suboptimal solution, and lead to long computing time. DA introduces the initial population individual optimization mechanism and adds the local rationality judgment mechanism, which makes the algorithm easy to fall into the local minima when calculating the best result. GWO and MFO have achieved excellent results in the comparison algorithm, which is to explore the search space widely, but it still cannot find the global optimal solution like SSQL. Compared to the existing methods, the proposed approach can effectively escape from trap points due to local minima and find the optimal global path. The limitation is that the computing time is longer compared to the comparison algorithms, but it is within the acceptable range.
In the application of mobile robot path planning, different working modes can be adopted according to the size of the mobile robot, the density of obstacles in the environment and the actual needs. When the mobile robot aims to reach the target point as soon as possible, the short path planning mode of SSQL can be used to make the mobile robot reach the target with the shortest path length and turning angle. When the purpose of the mobile robot is to avoid collision with obstacles, the safe path mode can be adopted, and the mobile robot will plan the feasible path away from the obstacles. The combination of the two modes can be better applied in the working environment according to the actual use of mobile robots.

Fig 5. Safe path planning (solution) for MR in different test environments (Each map shows the best path obtained by the SSQL algorithm, S is the starting point and T is the target point).
https://doi.org/10.1371/journal.pone.0275100.g005

Conclusions
In this paper, the SSQL for MR path planning algorithm is proposed, and simulation experiments are conducted according to the short and safe condition of MR in different environments. Two modes are switched for practical work, which can effectively solve the MR path planning problem, saving energy and time. By combining APF with the Q-learning algorithm to initialize the Q-table and setting the static and dynamic reward function to provide the prior knowledge in the environment to the MR, the convergence speed and computation time of CQL are accelerated. To demonstrate the effectiveness and generality of the proposed algorithm, simulation experiments are set up for the maritime environments, fully considering the reefs around the coast in the actual work. It demonstrates the proposed SSQL algorithm can effectively solve the path planning problem for MR, and the mean, standard deviation and t-test of the data analysis show that the proposed algorithm yields the smallest path length and smoothness, which significantly speeds up the computation time and convergence speed of the CQL. The problem of MR falling into local minima and not being able to derive effective paths is avoided. The results show that the SSQL algorithm does not have precocity and can calculate the optimal solution, and it can be further seen that PSO, GWO, DA, and MFO are more prone to precocity in the environment with sparse obstacles.